in this session we will learn how to build a simple linear an interpret the results. lets look at women dataset
head(women)
## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
we usually thing of x-axis as the independent or explanatory. As change in height has effect on change in weight of a women. The line of best fit can be used to predict what the weight will be given height. how much in change in input variable (weight) can be explained by output variable(height.).
one can imagine that there is a relationship between the height of a woman an her weight( taller women are likely to be heavier). a quick of the data confirms this assumption.
g<-women %>%
ggplot(aes(height,weight))+
geom_point(size=3,alpha=.5)+
geom_smooth(method = lm,se=F)+
theme_bw()+
labs(title = "weight explained by height in women",
x="height (explanatory or independent variable)",
y="weight (explanatory or dependent variable)")
ggplotly(g)
## `geom_smooth()` using formula = 'y ~ x'
Each point in the plot above represent a single height/ weight relationship. Taken together however, these points create a pattern. As height increases, so does the weight(seemingly in linear fashion). The blue line represent the best fit line. It is our model. The distance between each observed point and the model is called the “residual”
a linear model with this data can be used i two ways:
Determine how much of the change in height can be explained by change in height
Predict the weight of a women, given he height( even for height not included in the original data)
To create a model we simply use the lm() function. Once we have a model wee can take a look at some of the details thee within by using the summary() function. Let’s take a look
model1<- lm(weight~height, data=women)
summary(model1)
##
## Call:
## lm(formula = weight ~ height, data = women)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
There is some case where intercept is meaningless. example in our intercept is negative, the height of a person can’t be negative. for every change in height the weight goes up by 3.45 this relationship is significant with p value of 1.09e-14 less than 0.05. the F-statistic p-value is less than 0.05. meaning one of the independent variable can explain the model. adjusted R square become important when looking at multiple regression. multiple r squared of 0.991 means that 99.1% of change in weight can be explain by change in height.
we can use this model to predict the weight of a women given her height.
new_data<-data.frame(height=60)
predict(model1,new_data)
## 1
## 119.4833
if the women had height of 60 then her weight will be 119.48.
we can also predict a series of weight as follows
new_data<-data.frame(height=c(30,40,70))
round(predict(model1,new_data),digits = 1)
## 1 2 3
## 16.0 50.5 154.0
it is use to predict outcome variable values based on two or more explanatory variable. the relation between each explanatory variable and the outcome variable is examine while holing other variable constant. if necessary, interaction effects can be included to understand how the relationship between variable might change in combination with each other.
adding numeric explanatory variable is exactly the same. Let’s take a look at the tree dataset
g<-trees %>%
ggplot(aes(Girth,Volume,color=Height))+
geom_point()+
geom_smooth(method = lm, se=F)+
theme_bw()+
labs(title = "Tree volume explained by girth and height")
ggplotly(g)
now let’s create our model and summary to take a coser look at these relationship
attach(trees)
lm(Volume~Girth+Height, data=trees) %>%
summary()
##
## Call:
## lm(formula = Volume ~ Girth + Height, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4065 -2.6493 -0.2876 2.2003 8.4847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
## Girth 4.7082 0.2643 17.816 < 2e-16 ***
## Height 0.3393 0.1302 2.607 0.0145 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
## F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
The intercept for this model is meaningless since volume of a tree can be negative. for girth increase change in growth there will be increase in tree by 4.7082 and for every increase in height there is increase in volume by 0.3393 of a tree, and this result are statistically significant since there p value (0.0145 and < 2e-16) is less than 0.05.
The p value (2.2e-16) of f-statistic is less than 0.05 meaning that at least one of the independent variable can explain about the dependent variable. adjusted r square means that 94.% of change in volume can be explained by combination of change in girth and height of the tree. multiple R square means that the model explain 94% of the change in volume of the tree.
Let’s start by looking at the dataset mpg and take at the relationship between engine size( displ) and fuel consumption while driving on the highway (hwy)
head(mpg)
## # A tibble: 6 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
q<-mpg %>%
ggplot(aes(x=displ, y=hwy))+
geom_jitter()+
geom_smooth(method=lm,se=F)+
theme_bw()+
labs(title = "Highway fuel efficiency explained by engine size",
x="Engine size", y="Highway fuel efficiency")
ggplotly(q)
The relationship that your see shows that cars with bigger engines use less fuel. Lets ook at the simple linear model before adding new variable
attach(mpg)
lm(hwy~displ,data=mpg) %>%
summary()
##
## Call:
## lm(formula = hwy ~ displ, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1039 -2.1646 -0.2242 2.0589 15.0105
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.6977 0.7204 49.55 <2e-16 ***
## displ -3.5306 0.1945 -18.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.836 on 232 degrees of freedom
## Multiple R-squared: 0.5868, Adjusted R-squared: 0.585
## F-statistic: 329.5 on 1 and 232 DF, p-value: < 2.2e-16
Here we ignore the intercept since it says that when engine size is zero engine efficient is 35.6977.we consider engine alone as explanatory variable, so every incremental in engine size fuel efficiency decrease by 3.53. we can see also change in engine size explains 58.6% of the change in fuel efficiency. the p value is small meaning the model is efficiency.
we might expect there to be difference between four wheel drive cars and two wheel drive cars in terms of fuel efficiency and so we might want to build that into our moddel.
lets start by visualizing the data. in this case we shall combine front and rear wheel drive cars to create a new categorical called 2 as both cases are two drive cars.
l<-mpg %>%
mutate(drv=fct_recode(drv,'2'='f','2'='r' )) %>%
ggplot(aes(displ,hwy,color=drv))+
geom_point()+
geom_smooth(method = lm, se=F)+
theme_bw()+
labs(title = "Highway fuel efficiency explained by engin ne size",
x="Engine size", y="Highway fuel efficiency")
ggplotly(l)
The plot shows that both four wheel and two will are similar in relationship between engine sizes and fuel efficiency i.e in both cases it shows that cars with bigger engines use less fuel but 2 wheel drive cars seem to do better. it seems that if we move from a 4 wheel car to 2 wheel drive car we would gain certain amount of fuel efficiency. we can also note that the slope for the 4 and 2 wheel car drive are very similar meaning that the way engine size affect fuel efficiency is consistent, regardless of wheel drive car.
mpg %>%
mutate(drv=fct_recode(drv,'2'='f','2'='r' )) %>%
lm(hwy~displ+drv,data=.) %>%
summary()
##
## Call:
## lm(formula = hwy ~ displ + drv, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1064 -1.8738 -0.3735 1.5694 13.9163
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.5327 0.7346 41.56 <2e-16 ***
## displ -2.8409 0.1674 -16.97 <2e-16 ***
## drv2 4.9486 0.4347 11.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.077 on 231 degrees of freedom
## Multiple R-squared: 0.7353, Adjusted R-squared: 0.733
## F-statistic: 320.9 on 2 and 231 DF, p-value: < 2.2e-16
From the result is shows that as the engine size increase there is reduce in fuel efficiency by 2.84 while change from 4 wheel drive to 2 wheel drive car you gain a fuel efficiency of 4.9486 and its statistical significant.There is no 4 wheel drive since changing from 2 wheel drive will be meaning less. since this is a multiple regression we look at adjusted R square. 73.3% change in highway fuel efficiency can be explained by engine size and drive. The pvalue associate with f statistic is smmall, so we can be confident that the relationship that we are seeing are not by chance. The overall model does have statistically significant explanatory power.
mpg %>%
mutate(drv=fct_recode(drv,'2'='f','2'='r' )) %>%
lm(hwy~displ*drv,data=.) %>%
summary()
##
## Call:
## lm(formula = hwy ~ displ * drv, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1123 -1.8963 -0.3515 1.5467 13.9443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.6831 1.1140 27.543 < 2e-16 ***
## displ -2.8785 0.2681 -10.738 < 2e-16 ***
## drv2 4.7242 1.3213 3.575 0.000426 ***
## displ:drv2 0.0618 0.3436 0.180 0.857436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.083 on 230 degrees of freedom
## Multiple R-squared: 0.7353, Adjusted R-squared: 0.7319
## F-statistic: 213 on 3 and 230 DF, p-value: < 2.2e-16
The results shows that there is no interaction between engine size and 2 wheel drive cars since the p value(0.857436) is greater than 0.05 hence the result are statistical insignificant. This can be explained with reference to the plot you can see that the line are parallel
library(palmerpenguins)
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 2 more variables: sex <fct>, year <int>
p2<-penguins %>%
ggplot(aes(bill_depth_mm,bill_length_mm))+
geom_point()+
geom_smooth(method=lm, se=F)+
theme_bw()+
labs(title = "penguin bill length explained by bill depth",
x="bill depth",
y="bill length")
ggplotly(p2)
The relationship seem to be as depth goes up the bill length goes down but doesn’t provide a clear picture so we can drive further on multiple regression..
p3<-penguins %>%
ggplot(aes(bill_depth_mm,bill_length_mm,color=species))+
geom_point(alpha=.7)+
geom_smooth(aes(color=species),method=lm, se=F)+
theme_bw()+
labs(title = "penguin bill length explained by bill depth",
x="bill depth",
y="bill length")
ggplotly(p3)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
we look at the relation ship within each species with bill length and depth. we notice that within each of te color cluster above, the effect of species removed because within that cluster there is only one species to consider so there can’t be any effect that come from moving one species to the next.
lm(bill_length_mm~bill_depth_mm+species, data = penguins) %>%
summary()
##
## Call:
## lm(formula = bill_length_mm ~ bill_depth_mm + species, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0300 -1.5828 0.0733 1.6925 10.0313
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.2164 2.2475 5.88 9.83e-09 ***
## bill_depth_mm 1.3940 0.1220 11.43 < 2e-16 ***
## speciesChinstrap 9.9390 0.3678 27.02 < 2e-16 ***
## speciesGentoo 13.4033 0.5118 26.19 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.518 on 338 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7892, Adjusted R-squared: 0.7874
## F-statistic: 421.9 on 3 and 338 DF, p-value: < 2.2e-16
All the p vaue for the f statistic is very small therefore the overall model is working. At this case the intercept isn’t meaningful since the bill length can be 0. 78% change in bill length can be explained by combination of change in species and bill depth. With every incremental change in bill depth correspond to 1.39 increase in bill length.provided that model was held constant then if we move from being a adelle to being a chinstrap there will be 9.939 incremental change in bill length will to gentoo there will be 13.40 increase in bill length.
lets look at simple linear model that consider the relationship between the weight of cars and fuel efficiency in the mtcars dataset
mtcars %>%
ggplot(aes(wt,mpg))+
geom_point()+
geom_smooth(method = lm, se=F)+
theme_bw()+
labs(title = 'fuel efficiency explained by weight of cars', x= 'weight of cars', y='fuel efficiency')
## `geom_smooth()` using formula = 'y ~ x'
lm(mpg~wt, data=mtcars) %>%
summary()
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
the results shows that as weight increase the fuel efficiency decrease by 5.34.
mtcars %>%
mutate(am=as.factor(am)) %>%
mutate(am=fct_recode(am, 'automatic'='0', 'manual'='1')) %>%
ggplot(aes(wt,mpg,color=am))+
geom_jitter()+
geom_smooth(method = lm, se=F)+
theme_bw()+
labs(title = 'fuel efficiency explained by weight of cars')
## `geom_smooth()` using formula = 'y ~ x'
mtcars %>%
mutate(am=as.factor(am)) %>%
mutate(am=fct_recode(am, 'automatic'='0', 'manual'='1')) %>%
lm(mpg~wt*am, data=.) %>%
summary()
##
## Call:
## lm(formula = mpg ~ wt * am, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6004 -1.5446 -0.5325 0.9012 6.0909
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.4161 3.0201 10.402 4.00e-11 ***
## wt -3.7859 0.7856 -4.819 4.55e-05 ***
## ammanual 14.8784 4.2640 3.489 0.00162 **
## wt:ammanual -5.2984 1.4447 -3.667 0.00102 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.591 on 28 degrees of freedom
## Multiple R-squared: 0.833, Adjusted R-squared: 0.8151
## F-statistic: 46.57 on 3 and 28 DF, p-value: 5.209e-11
The results shows that their is prediction since the f statistic is statistical significant and the change in weight of car variable explain 83.3% of change in efficiency of fuel. in this case the intercept is meaningless. It also shows that when you move from automatic to manual transmission then there will be reduction in fuel efficiency by 14.87 assuming the weight is kept constant. It also shows that interaction is statistical significant between weight of the car and the manual transmission since the p value is less than 0.05 and is negatively impacted additional by 5.3 for every increase in unit of car. The plot also show the interaction since the line have cross each other.
lets consider interaction with a numeri variable, in this case we wil look at horse power
mtcars %>%
lm(mpg~wt*hp, data=.) %>%
summary()
##
## Call:
## lm(formula = mpg ~ wt * hp, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0632 -1.6491 -0.7362 1.4211 4.5513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.80842 3.60516 13.816 5.01e-14 ***
## wt -8.21662 1.26971 -6.471 5.20e-07 ***
## hp -0.12010 0.02470 -4.863 4.04e-05 ***
## wt:hp 0.02785 0.00742 3.753 0.000811 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.153 on 28 degrees of freedom
## Multiple R-squared: 0.8848, Adjusted R-squared: 0.8724
## F-statistic: 71.66 on 3 and 28 DF, p-value: 2.981e-13
The interaction between the weight nad horse power is statistical significant and resultant model is able to explain 87% of the change in fuel efficiency.
The result shows that for every unit increase weight of a car and horse power there is decrease of 8.21 and 0.12 in the fuel efficiency respectively.
The interaction between the cars weight and horse power is statically significant since pvalue is less than 0.05. for heavier cars, the penalty on fuel efficiency is lower per additional horse power than in lighter cars.
This is is are steps to follow when selecting variable for the model.
understand your data: look at the variable in the dataset and see what it represent. check the data for missing value and explore the data using summaries and visualizations to get a feel for the relationship and distribution.
Define your objectives: Be clear on what you are trying to predict or explain with model. check factors are likely to influence the predicted variable.
Univariate analysis: Examine each potential predictors individual relationship with the outcome variable via correlation coefficients for continuous variable or comparison of mean for categorical variable.
multivariate analysis: This can be done using;
check for multicollinearity. where the predictor are highly correlated with each other. ### Example
lets take a look at swiss datasets .
glimpse(swiss)
## Rows: 47
## Columns: 6
## $ Fertility <dbl> 80.2, 83.1, 92.5, 85.8, 76.9, 76.1, 83.8, 92.4, 82.4,…
## $ Agriculture <dbl> 17.0, 45.1, 39.7, 36.5, 43.5, 35.3, 70.2, 67.8, 53.3,…
## $ Examination <int> 15, 6, 5, 12, 17, 9, 16, 14, 12, 16, 14, 21, 14, 19, …
## $ Education <int> 12, 9, 5, 7, 15, 7, 7, 8, 7, 13, 6, 12, 7, 12, 5, 2, …
## $ Catholic <dbl> 9.96, 84.84, 93.40, 33.77, 5.16, 90.57, 92.85, 97.16,…
## $ Infant.Mortality <dbl> 22.2, 22.2, 20.2, 20.3, 20.6, 26.6, 23.6, 24.9, 21.0,…
Take a look at relationship between each explanatory and fertility to tell us if there are appropriate for the model. we shall use cor() function to create a matrix of correlation.
NOTE if your data has missing value (,use=“complete.obs”)
cor(swiss) %>%
round(2)
## Fertility Agriculture Examination Education Catholic
## Fertility 1.00 0.35 -0.65 -0.66 0.46
## Agriculture 0.35 1.00 -0.69 -0.64 0.40
## Examination -0.65 -0.69 1.00 0.70 -0.57
## Education -0.66 -0.64 0.70 1.00 -0.15
## Catholic 0.46 0.40 -0.57 -0.15 1.00
## Infant.Mortality 0.42 -0.06 -0.11 -0.10 0.18
## Infant.Mortality
## Fertility 0.42
## Agriculture -0.06
## Examination -0.11
## Education -0.10
## Catholic 0.18
## Infant.Mortality 1.00
general guide interpreting:
0.3-0.7 moderate correlation
0.7-1 highly correlated
In the results we can see that all possible explanatory variable are moderately corrected with fertility. We can also see that examination and education have high correlation we need close view weather to include one or all to the final model. #### AIC method.
when comparing model a lower AIC value indicates a better model. The key idea behind the AIC is to find a model that offers a good trade off between fewer parameters and fitting the data well, helping to avoid over fitting and under fitting. This method is mostly used when dealing large datasets and multiple potential explanatory variable. lets use AIC method to identify the best combination of explanatory variable to predict fertility.
library(MASS)
names(swiss)
## [1] "Fertility" "Agriculture" "Examination" "Education"
## [5] "Catholic" "Infant.Mortality"
This are the variable names in the swiss dataset.
lm(Fertility~.,data = swiss) %>%
step(direction = "backward",trace = 0) %>%
summary()
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6765 -6.0522 0.7514 3.1664 16.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
## Agriculture -0.15462 0.06819 -2.267 0.02857 *
## Education -0.98026 0.14814 -6.617 5.14e-08 ***
## Catholic 0.12467 0.02889 4.315 9.50e-05 ***
## Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
## F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10
The AIC method have excluded examination and this explanatory variable selected contribute 69.93% to the output variable.